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Abstract — The main principle of stacked generalization (or Stacking) 
is using a second-level generalizer to combine the outputs of base 
classifiers in an ensemble. In this paper, we investigate different combi- 
nation types under the stacking framework; namely weighted sum (WS), 
class-dependent weighted sum (CWS) and linear stacked generalization 
(LSG). For learning the weights, we propose using regularized empirical 
risk minimization with the hinge loss. In addition, we propose using 
group sparsity for regularization to facilitate classifier selection. We per- 
formed experiments using two different ensemble setups with differing 
diversities on 8 real-world datasets. Results show the power of regular- 
ized learning with the hinge loss function. Using sparse regularization, 
we are able to reduce the number of selected classifiers of the diverse 
ensemble without sacrificing accuracy. With the non-diverse ensembles, 
we even gain accuracy on average by using sparse regularization. 

Index Terms — classifier combination, classifier selection, regularized 
empirical risk minimization, hinge loss, group sparsity 



1 Introduction 

Classifier ensembles aim to increase efficiency of classi- 
fier systems in terms of accuracy at the expense of in- 
creased complexity and they are shown to obtain greater 
performance than single-expert systems for a broad 
range of applications. Among all theoretical and prac- 
tical reasons to prefer using ensembles, which are cate- 
gorized as statistical, computational and representational in 
0], the most important ones are the statistical reasons. 
Since we are looking for the generalization performance 
(error in the test data) in pattern recognition problems, it 
is often very difficult to find the "perfect classifier", but 
by combining multiple classifiers probability of getting 
closer to the perfect classifier is increased. An ensem- 
ble may not always beat the performance of the best 
single classifier obtained, but it will surely decrease the 
variance of the classification error. Some other reasons 
besides statistical reasons can be found in EJ, 0. 

The straightforward method to obtain an ensemble is 
using different classifier types or different parameters. 
Also training base classifiers with different subsets or 
samplings of data or features is used to obtain ensem- 
bles which will result in more diverse ensembles. There 
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are different measures of diversity of an ensemble, but 
diversity simply means that base classifiers make errors 
on different examples. Diverse ensembles result in better 
performance with a reasonable combiner. In this work, 
we are not interested in the methods of obtaining the 
ensemble, but we investigate various linear combination 
types for a given set of base classifiers. 

Base classifiers produce either label outputs or con- 
tinuous valued outputs. For the former, combiners like 
majority voting or weighted majority voting are used. 
In the latter case, base classifiers produce continuous 
scores for each class that represent the degree of support 
for each class. They can be interpreted as confidences 
in the suggested labels or estimates of the posterior 
probabilities for the classes |3|. Former thinking is more 
reasonable since for most of the classifier types, support 
values may not be very close to the actual posterior 
probabilities even if the data is dense, because classifiers 
generally do not try to estimate the posterior probabili- 
ties, but try to classify the data instances correctly so they 
usually only try to force the true class' score to be the 
maximum. In this paper, we deal with the combination 
of continuous valued outputs. 

Combination rules can be grouped into trainable vs. 
non-trainable (or supervised vs. unsupervised). Sim- 
ple average (mean), product, trimmed mean, minimum, 
maximum and median rules are some examples to non- 
trainable combiners. Learning the combiner from train- 
ing data is shown to give better accuracy than non- 
trainable combiners. Among trainable combiners, such 
as stacked generalization (Stacking) |4], Decision Tem- 
plates [3J and Dempster-Shafer Combination |5|; stacked 
generalization is deeply investigated and analyzed in the 
literature ffl, 0, Q, ®, 0, El El, d, El, El, El. 
The idea of Stacking is to use the confidence scores that 
are obtained from base classifiers as attributes in a new 
training set with the original class labels and training a 
meta-classifier (This classifier is called level-1 generalizer 
in |4J) with this new dataset. Considering the speed and 
complexity advantage of linear meta-classifiers over non- 
linear ones, they are usually preferred in the literature. 
When initially introduced, stacking is used to combine 
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the class predictions of the base classifiers [4J. Ting 
& Witten used confidence scores of base classifiers as 
input features and improved stacking's performance |6|, 
[8|. Merz used stacking and correspondence analysis to 
model the relationship between the learning examples 
and their classification by a collection of learned models 
and used nearest neighbor classifier as meta learner. Dze- 
roski & Zenko used multi-response model trees as the 
meta-learner [11 j. Seewald introduced StackingC, which 
improves Stacking's performance further and reduces 
the computational cost by introducing class-conscious 
combination. [9 J. Sill, incorporated meta-features with 
the posterior scores of base classifiers to improve accu- 
racy [12 1 . Ledezma, used genetic algorithms to search 
for good Stacking configurations 1151 . Tang, re-ranked all 
possible class labels according to the scores and obtained 
a learner which outperforms all base classifiers [16]. 

Since training the base classifiers and the combiner 
with the same data samples will result in overfitting, 
a sophisticated cross-validation is applied to obtain the 
training data of the combiner (level-1 data). This pro- 
cedure, called internal cross-validation, is described in 
section|2] After obtaining level-1 data, there are two main 
problems remaining for a linear combination: (1.) Which 
type of combination method should be used? (2.) Given 
a combination type, how should we learn the parameters 
of the combiner? For the former problem, Ueda ||T7|| 
defined three linear combination types namely type-1, 
type-2 and type-3. In this paper, we use the descriptive 
names weighted sum (WS), class-dependent weighted 
sum (CWS) and linear stacked generalization (LSG) for 
these types of combinations respectively and investigate 
all of them. In J7|, El, LSG is used and CWS combination 
is proposed in [6J. For the latter problem, Ting & Witten 
proposed a multi-response linear regression algorithm 
for learning the weights Ueda in ||T7| proposed 
using minimum classification error (MCE) criterion for 
estimating optimal weights, which increased the accura- 
cies. MCE criterion is an approximation to the zero-one 
loss function which is not convex, so finding a global 
optimizer is not always possible. Ueda derived algo- 
rithms for different types of combinations with MCE loss 
using stochastic gradient methods. Both of these studies 
ignored "regularization" which has a huge effect on the 
performance, especially if the number of base classifiers 
is large. Reid & Grudic in ] 13 1 regularized the standard 
linear least squares estimation of the weights with CWS 
and improved the performance of stacking. They applied 
I2 norm penalty, l\ norm penalty and combination of the 
two (elastic net regression). In this work, we propose 
maximum margin algorithms for learning the optimal 
weights. We work with the regularized empirical risk 
minimization framework [18J and use the hinge loss 
function with I2 regularization, which corresponds to the 
support vector machines (SVM). We do not derive algo- 
rithms for the solutions of the minimization problems, 
but state-of-the-art solutions of SVM in the literature can 
be modified for our problem. 



Another issue, recently addressed in lTl9l , is combi- 
nation with a sparse weight vector so that we do not 
use all of the ensemble. Since we do not have to use 
classifiers which have zero weight on the test phase, 
overall test time will be much less. Zhang formulated 
this problem as a linear programming problem for only 
the WS combination type |19|. Reid used norm reg- 
ularization for CWS combination |Tl3"l . In this paper, 
we investigate sparsity issues for all three combination 
types: WS, CWS and LSG. We use both l\ norm and 
l\ — I2 norm for regularization in the objective function 
for CWS and LSG. Latter regularization results in group 
sparsity, which is deeply investigated and successfully 
applied to various problems recently. 

Throughout the paper, we used m for the classi- 
fier subscript, n for the class subscript, i for the data 
instance subscript, M, N and / for the number of 
classifiers, classes and data instances respectively. Dat- 
apoint subscript i is sometimes dropped for simplicity. 
In section [2] we explain the cross-validation technique 
used in stacked generalization. In section [5J we define 
the classifier combination problem formally and define 
three different combination types used in the literature, 
namely WS, CWS and LSG. In section |1J we explain 
how the weights are learned using regularized empirical 
risk minimization framework with hinge loss and a 
regularization function. In section |5j we define sparse 
regularization functions to enforce classifier selection. In 
section |6j we describe the experiment setups we build. 
In section [7j we show the results of our experiments and 
discuss them. 



2 Stacked Generalization 

A novel approach has been introduced in 1992 known 
as stacked generalization or stacking [4J. The basic idea 
is applying a meta-level (or level-1) generalizer to the 
outputs of base classifiers (or lev el -0 classifiers). For 
training the level-1 classifier, we need the confidence 
scores (Level-1 Data) of the training data, but training the 
combiner with the same data instances which are used 
for training the base classifiers will lead to overfitting 
the database and eventually result in poor generalization 
performance. Stacking deals with this problem by a 
sophisticated cross-validation method (internal CV), in 
which training data is divided into k parts and each 
part of the data is tested with the base classifiers that 
are trained with the other k — 1 parts of data. So at 
the end, each training instance's score is obtained from 
the base classifiers whose training data does not contain 
that particular instance. This procedure is repeated for 
each base classifier in the ensemble. We apply this pro- 
cedure for the three different linear combination types. 
Wolpert combined only the class predictions with this 
framework, Ting & Witten improved the performance 
of stacking by combining continuous valued predictions 
0. 
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3 Combination Types 

3.1 Problem Formulation 

In the classifier combination problem, input to the com- 
biner are the posterior scores belonging to different 
classes obtained from the base classifiers. Let be the 
posterior score of class n obtained from classifier m for 
any data instance. Let p m = [p^,p^, . . . ,p^] T , then the 
input to the combiner is f = \p[, p% > ■ ■ - > VaiV > where N 
is the number of classes and M is the number of clas- 
sifiers. Outputs of the combiner are N different scores 
representing the degree of support for each class. Let r n 
be the combined score of class n and let r = [r 1 , . . . , r N ) T ; 
then in general the combiner is defined as a function 
g : R MN -> R N such that r = g(f). Let I be the number of 
training data instances, f; contain the scores for training 
data point i obtained from base classifiers with stacking 
and yi be the corresponding class label; then our aim 
is to learn the g function using data {(£,-, j/i)}f =1 . On the 
test phase, label of a data instance is assigned as follows: 

y = argmaxr™, (1) 

ne[N] 

where [N] = {1,...,N}. Among combination types, 
linear ones are shown to be powerful for the classifier 
combination problem. For linear combiners, g function 
has the following form: 

5 (f)=Wf + b. (2) 

In this case, we aim to learn the elements of W G R NxMN 
and b € R N . So, the number of parameters to be learned 
is MN 2 +N. This type of combination is the most general 
form of linear combiners and called type-3 combination 
in [171 . In the framework of stacking, we call it linear 
stacked generalization (LSG) combination. One disad- 
vantage of this type of combination is that, since the 
number of parameters is high, learning the combiner 
takes a lot of time and may require a large amount of 
training data. To overcome this disadvantage, simpler 
but still strong combiner types are introduced with the 
help of the knowledge that is the posterior score of 
class n. We call these methods weighted sum (WS) rule 
and class-dependent weighted sum (CWS) rule. These 
types are categorized as class-conscious combinations in 
0. 

3.2 Linear Combination Types 

In this section, we describe and analyze three combina- 
tion types, namely weighted sum rule (WS), class-dependent 
weighted sum rule (CWS) and linear stacked generalization 
(LSG) where LSG is already defined in Q. 

3.2. 1 Weighted Sum Rule 

In this type of combination, each classifier is given a 
weight, so there are totally M different weights. Let u m 



be the weight of classifier to, then the final score of class 
n is estimated as follows: 

M 

r n = X>w4=u T f" > n = l,...,N, (3) 

m— 1 

where f" contains the scores of class n: f™ = 
[Pi, . . . ,Pm] t and u = [u\, . . . , um] T ■ For the framework 
given in ||2j, WS combination can be obtained by letting 
b = and W to be the concatenation of constant diagonal 
matrices: 

W=[ Ui I n \...\umIn], (4) 

where In is the N x N identity matrix. We expect to 
obtain higher weights for stronger base classifiers after 
learning the weights from the database. 

3.2.2 Class-Dependent Weighted Sum Rule 

The performances of base classifiers may differ for dif- 
ferent classes and it may be better to use a different 
weight distribution for each class. We call this type of 
combination CWS rule. Let be the weight of classifier 
to for class n, then the final score of class n is estimated 
as follows: 

M 

r n = Y,<P n m =<t n . n = l,...,N, (5) 

m— 1 

where v n = [w™, . . . , w^f] T - There are MN parameters in 
a CWS combiner. For the framework given in ||2j, CWS 
combination can be obtained by letting b = and W to 
be the concatenation of diagonal matrices; but unlike in 
WS, diagonals are not constant: 

W=[W 1 |W a |...|W M ], (6) 

where W m e M. NxN are diagonal for rn = 1, . . . , M. 

3.2.3 Linear Stacked Generalization 

This type of combination is the most general form of 
supervised linear combinations and is already defined 
in |2|. With LSG, score of class n is estimated as follows: 

r" = w£f + &„ , n=l,...,N, (7) 

where w„ € R MN is the n th row of W and b n is the 
n th element of b. LSG can be interpreted as feeding the 
base classifiers' outputs to a linear multi-class classifier 
as a new set of features. This type of combination may 
result in overfitting to the database and may give lower 
accuracy then WS and CWS combination when there is 
not enough data. From this point of view, WS and CWS 
combination can be treated as regularized versions of 
LSG. A crucial disadvantage of LSG is that the number 
of parameters to be learned is MN 2 +N which will result 
in a long training period. 

There is not a single superior one among these three 
combination types since results are shown to be data 
dependent 11201 . A convenient way of choosing the com- 
bination type is selecting the one that gives the best 
performance in cross-validation. 
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4 Learning the Combiner 

We use the regularized empirical risk minimization 
(RERM) framework IflBl for learning the weights. In this 
framework, learning is formulated as an unconstrained 
minimization problem and the objective function con- 
sists of a summation of empirical risk function over data 
instances and a regularization function. Empirical risk 
is obtained as a sum of "loss" values obtained from 
each sample. Different choices of loss functions and reg- 
ularization functions correspond to different classifiers. 
Using hinge loss function with l 2 norm regularization 
is equivalent to support vector machines (SVM). It has 
been shown in studies that the hinge loss function yields 
much better classification performance as compared to 
the least-squares(LS) loss function in general. Earlier 
classifier combination literature uses LS loss function |6|, 
[8 1, [13 1, which is suboptimal as compared to the hinge 
loss that we promote and use in this paper. Using least- 
squares with I2 regularization is equivalent to applying 
least-squares support vector machine (LS-SVM) [21 J. We 
use an adaptation of SVM in multiclass problems defined 
in Il22l . With this adaptation, we find the linear separat- 
ing hyper-plane that maximizes the margin between true 
class and the most offending wrong class. For LSG, we 
have the following objective function: 

1 1 

^L SG (W,b) = - V (1 - rf + maxr?)+ + Ai? iSG (W), 
i—i 

(8) 

where Rlsg(W) is the regularization function, (x) + = 
max(0, x) and the posterior score of data instance i for 
class n, rf, is given as follows: 

rf=^ T J l + b n . (9) 

A e R in ^ is the regularization parameter which is 
usually learned by cross validation. Objective function 
given in (Jsj> encourages the distance between the true 
class' score and the most offending wrong class' score 
to be larger than one. A conventional regularization 
function is the Frobenius norm of W: 

JV 

J?MG(W) = ||W|| i r = ^||w„||, (10) 
n=l 

Equation ||8} is given for LSG but it can be modi- 
fied for other types of combinations using the unifying 
framework described in [20]. But we also give objective 
functions for WS and CWS explicitly. Objective function 
of WS is as follows: 
1 I 

c/> ws (u) = - V (1 - u T ff + max (u T fD)+ + XR ws (vl). 

1 ~, n^yt 

(ii) 

For regularization, we use 1% norm of u: Rws — I Mb- 
For CWS, we have the following objective function: 

1 1 

fcws(V) = 7 E (! - v £ f r + nmx(v^ l ))++A J fiWs(V) : 

(12) 



where V e R Mx " contains the weights for different 
classes: V = [vi, . . . , vjv]. As for LSG, conventional 
regularization function for CWS is the Frobenious norm 
of V: Rcws{V) = \\V\\ F . 



5 Sparse Regularization 

In this section, we define a set of regularization functions 
for enforcing sparsity on the weights so that the resulting 
combiner will not use all the base classifiers leading to 
a shorter test time. This method can be seen as a classi- 
fier selection algorithm, but here classifiers are selected 
automatically and we cannot determine the number of 
selected classifiers beforehand. But we can lower this 
number by increasing the weight of the regularization 
function (A), and vice versa. With sparse regularization, 
A has two main effects on the resulting combiner. First, 
it will determine how much the combiner should fit the 
data. Decreasing A results in more fitting the training 
data and decreasing it too much results in overfitting, 
on the other hand, increasing it too much prevents the 
combiner to learn from the data and the accuracy drops 
dramatically. Second, as mentioned before, it will deter- 
mine the number of selected classifiers. As A increases, 
the number of selected classifiers decreases. 



5.1 Regularization with the h Norm 

The most successful approach for inducing sparsity is 
using the l\ norm of the weight vector for WS. For CWS 
and LSG, in which the combiner consists of matrices, 
we can concatenate the weights in a vector and take 
the 1 1 norm or equivalently we can take the h — h 
norm of the weight matrices. We have the following 
sparse regularization functions for WS, CWS and LSG 
respectively: 

R ws (v) = \\u\\ 1 , (13) 

N 

J?cws(V) = ||V|| 1)1 = El|v»l|i ! (14) 

71=1 

N 

^sg(W) = ||W|| 1 , 1 = EI|w„||i. (15) 

n=l 

If all weights of a classifier are zero, that classifier will be 
eliminated and we do not have to use that base classifier 
for a test instance, so that testing will be faster. But 
the problem with Zi-norm regularizations for CWS and 
LSG is that we are not able to use all the information 
from a selected base classifier, because a classifier may 
receive both zero and non-zero weights. To overcome 
this problem, we propose to use group sparsity, as 
explained in the next section. 
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5.2 Regularization with Group Sparsity 

We define another set of regularization functions which 
are embedded by group sparsity for LSG and CWS to 
enforce classifier selection. The main principle of the 
group sparsity is enforcing all elements that belong to a 
group to be zero altogether. Grouping of the elements are 
done before learning. In classifier combination, posterior 
scores obtained from each base classifier form a group. 
The following regularization function yields group spar- 
sity for LSG: 

M 

RLSG(W)='E l \\Vf m \\F- (16) 
m—1 

For CWS, we use the following regularization: 

M 

Rcws(V) = l|V||i, 2 = ]T ||v m || 2 , (17) 

m—1 

where v m is the m th row of V, so it contains the 
weights of the classifier m. After the learning process, 
the elements of v m for any m are either all zero or 
all non-zero. This leads to better performance than li 
regularization for automatic classifier selection, as we 
show in section [7] In the next section, we describe the 
setup of the experiments. 

6 Experimental Setups 

We have performed extensive experiments in eight real- 
world datasets from the UCI repository |23|. For a sum- 
mary of the characteristics of the datasets, see Table [T] In 
order to obtain statistically significant results, we applied 
5x2 cross-validation [24] which is based on 5 iterations 
of 2-fold cross-validation (CV). In this method, for each 
CV, data is randomly split into two stacks as training and 
testing resulting in overall 10 stacks for each database. 

We constructed two ensembles which differ in their 
diversity. In the first ensemble, we construct 10 different 
subsets randomly which contains 80% of the original 
data. Then, 13 different classifiers are trained with each 
subset resulting in a total of 130 base classifiers. We 
used PR-Tools [25] and Libsvm toolbox [26] for obtaining 
the base classifiers. These 13 different classifiers are: 
normal densities based linear classifier, normal densities 
based quadratic classifier, nearest mean classifier, k- 
nearest neighbor classifier, polynomial classifier, general 
kernel/dissimilarity based classification, normal densi- 
ties based classifier with independent features, parzen 
classifier, binary decision tree classifier, linear percep- 
tron, SVM with linear kernel, polynomial kernel and 
radial basis function (RBF) kernel. We used default pa- 
rameters of the toolboxes. In the second ensemble setup, 
we trained a total of 154 SVM's with different kernel 
functions and parameters. Latter method produces less 
diverse base classifiers with respect to the former one. 
Training data of the combiner is obtained by 4-fold 
stacked generalization. For each stack in 5 x 2 CV, 2-fold 
CV is used to obtain the optimal A in the regularization 



function, i.e. A which gives the best average accuracy in 
CV Q For the minimization of the objective functions, 
we used the CVX-toolbox 1271 . We use the Wilcoxon 
signed-rank test for identifying statistical significance of 
the results with one-tailed significant level a = 0.05 |28|. 

TABLE 1: Properties of the data sets used in the experi- 
ments 



DB 


# of Instances 


# of classes 


# of features 


Segment 2 


2310 


7 


19 


Waveform a 


5000 


3 


21 


Robot 4 


5456 


4 


24 


Statlog 5 


846 


4 


18 


Vowel b 


990 


11 


10 


Wine 


178 


3 


13 


Yeast 


1484 


9 


8 


Steel '< 


1941 


7 


27 



7 Results 

First, we investigate the performance of regularized 
learning of the weights with the hinge loss compared 
to the conventional least squares loss IIT31 and the multi- 
response linear regression method which does not con- 
tain regularization |6| with the diverse ensemble setup 
described in section [6] It should be noted that results 
shown here and in [13], [6] are not directly comparable 
since construction of the ensembles is different. Error 
percentages of these three different learning algorithms 
for WS, CWS and LSG are given in Table [2] Results for 
the simple sum rule, which is equivalent to using equal 
weights in the WS, are also given in the column titled 
EW. The first entries in the boxes are the means of error 
percentages over 5x2 CV stacks and the second entries 
are the standard deviations. For five datasets, the lowest 
error means are obtained with the hinge loss function 
and for two datasets lowest error means are obtained 
with the least-squares loss function. On yeast dataset, 
simple averaging works better than the supervised learn- 
ers. On all datasets, MLR method results in higher 
error percentages compared to other methods, and this 
shows the power of regularized learning, especially if 
the number of base classifiers is high. It should be noted 
that in (6j, 3 base classifiers are used and here we use 
130 base classifiers. According to the pairwise Wilcoxon 
signed-ranks test [28 1, hinge loss function outperforms 
least squares loss function at one-tailed significant level 
a = 0.05 for WS and CWS combination types and at 
a = 0.0525 for LSG combination. 

1. We searched for A in {10 -11 , 10" 9 ,10- 7 , 10~ 5 , 10" 3 , 0.005, 
0.01,0.05,0.1,0.5,1,10} 

2. The full name of Segment dataset is "Image Segmentation" 

3. The full name of Waveform dataset is "Waveform Database Generator 
(Version 1)" 

4. The full name of Robot dataset is "Wall-Following Robot Navigation 
Data" 

5. The full name of Statlog dataset is "Statlog (Vehicle Silhouettes)" 

6. The full name of Vowel dataset is "Connectionist Bench (Vowel 
Recognition - Deterding Data)" 

7. The full name of Steel dataset is "Steel Plates Faults" 
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TABLE 2: Error percentages in the diverse ensemble setup (mean ± standard deviation). 



DB 


Hinge Loss with I2 regularization 
WS | CWS | LSG 


Least Squares Loss with /2 regularization 
WS CWS | LSG 


MLR 

WS CWS | LSG 


EW 


Segment 


5.02 : (i 88 


3.59 : 0.96 


3.44 ± 0.61 


6.44 ! H.75 


5.69 0.89 


S.59 0.76 


7.20 1 1.1)2 


6 66 1 6.64 


61.28 1 '1.35 


7.37 1 1.0". 


Waveform 


13.20 ± 0.69 


13.08 ± 0.76 


13.05 ± 0.65 


13.15 ± 0.71 


13.30 ± 0.76 


13.15 ± 0.75 


13.33 ± 0.68 


14.10 ± 0.56 


18.40 ± 7.06 


14.17 ± 0.60 


Robot 


3.95 + 0.42 


2.53 ± 0.28 


2.61 ± 0.28 


5.21 ± 0.58 


2.52 ± 0.30 


2.50 ± 0.27 


5.05 ± 0.62 


2.58 ± 0.30 


3.19 ± 0.49 


18.58 1 0.61 


Statlog 


16.34 ± 1.15 


16.12 ± 1.94 


16.16 ± 1.67 


17.64 ± 1.65 


16.90 ± 1.89 


16.76 ± 1.64 


17.73 ± 2.11 


58.01 ± 15.38 


75.72 1 6.18 


23.03 1 2.33 


Vowel 


13.84 ± 2.73 


6.79 ± 1.31 


6.30 ± 1.99 


13.98 ± 2.64 


6.55 ± 2.20 


6.55 ± 1.85 


17.15 ± 2.31 


10.08 ± 1.75 


9.76 ± 1.14 


14.53 ± 3.30 


Wine 


2.13 ± 1.54 


1.69 ± 1.52 


1.91 ± 1.76 


2.70 ± 2.13 


2.25 ± 1.83 


6.85 + 17.36 


3.71 ± 2.31 


8.20 ± 16.19 


2.47 ± 1.66 


2.81 ± 1.52 


Yeast 


40.36 ± 1.21 


40.63 ± 1.21 


40.70 ± 1.68 


46.15 ± 18.76 


40.42 ± 1.03 


41.36 ± 1.32 


41.05 ± 1.04 


53.11 ± 6.88 


74.45 ± 6.42 


40.26 ± 1.10 


Steel 


29.85 ± 1.86 


27.37 ± 1.18 


27.41 ± 1.22 


31.06 ± 1.85 


27.36 ± 1.17 


28.03 ± 2.77 


30.35 ± 1.34 


51.40 ± 14.66 


77.12 ± 7.82 


31.57 ± 2.07 



MEAN I 15.59 I 1.31 I 13.97 £ 1.15 I 13.97 ± 1.23 | 17.04 n 3.63 | 14.12 ± 1.2ft | 14.85 : 3.34 \ lh.94 :. 1.43 | 25.52 I 7.S0 | 40.30 :l. 5.01 | 19.04 1 157 




1.00E-06 1.00E-05 

♦ 12 accuracy 



1.00E-04 1.00E-03 1.00E-02 1.00E-01 

A II accuracy — ^ *ll # of classifiers 



Fig. 1: Accuracy and Number of selected classifiers vs. 
A for WS combination of Robot data with the diverse 
ensemble setup 
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Fig. 3: Accuracy and Number of selected classifiers vs. 
A for LSG combination of Robot data with the diverse 
ensemble setup 
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Fig. 2: Accuracy and Number of selected classifiers vs. 
A for CWS combination of Robot data with the diverse 
ensemble setup 



We also investigated the performance of sparse reg- 
ularization with the hinge loss function. We used two 
different ensemble setups described in the beginning of 
this section. Regularization parameter A given in the ob- 
jective functions < 8]11]12| is an important parameter and 
if we minimize the objective functions also over A, the 
combiner will overfit the training data, which will result 
in poor generalization performance. Therefore, we used 
2-fold cross-validation to learn the optimal parameter. 
We plot the relation of A with accuracies and the number 
of selected classifiers for different regularizations with 
WS, CWS and LSG for Robot dataset in Figures [l] [2] and 
[3] respectively. In these figures, dashed lines correspond 
to the number of selected classifiers and solid lines 



correspond to accuracies. The l\ — I2 label represents 
group sparsity. In all sparse regularizations, the best 
accuracies are obtained when most of the base classi- 
fiers are eliminated. For all regularizations, accuracies 
make a peak at A values between 0.001 and 0.1. For l\ 
norm regularization, accuracies drop dramatically with 
a small increase in A. However, with group sparsity 
regularization, accuracies remain high in a larger range 
for A than that with the l\ norm regularization. Thus the 
performance of l\ regularization is more sensitive to the 
selection of A. So we can say that the l\ — I2 norm reg- 
ularization is more robust than l\ norm regularization. 
As the number of selected classifiers decrease, accuracies 
increase in general, but this increase in the accuracy 
cannot be attributed to the classifier selection, because 
A also determines how much the combiner should fit 
the data as discussed in section [5] 

Next, we show the test results for all combination 
types with various regularization functions. Error per- 
centages (mean ± standard deviation) are shown in Table 
[3] for the diverse ensemble setup and corresponding 
number of selected classifiers are shown in Table |4] 

In general, we are able to use much less base classifiers 
with sparse regularizations with the cost of a small 
decrease in the accuracies. For CWS, group sparsity reg- 
ularization outperforms l\ norm regularization at one- 
tailed significance level a = 0.005. For LSG, average 
error percentage of group sparsity is a little less than that 
of the l\ norm regularization which is not statistically 
significant. But the number of selected base classifiers is 
much less. So if classifier selection is desired, we suggest 
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TABLE 3: Error percentages with the diverse ensemble setup (mean ± standard deviation). Bold values are the lowest 
error percentages of sparse regularizations (l\ or l\ — I2 regularizations) 



DB 


WS 

(2 | h 


CWS 

h | Ii 1 h-h 


LSG 

h ] It \ h-h 


EW 


Segment 


5.02 : OSN 


4.90 ± 0.99 


3.59 ± 0.96 


3.62 ± 0.62 


3.74 ± 0.40 


3.44 ± 0.61 


3.79 ± 1.05 


3.29 ± 0.55 


7.37 1 1.03 


Waveform 


13.20 ± 0.69 


13.38 ± 0.70 


13.08 ± 0.76 


13.46 ± 0.74 


13.42 ± 0.76 


13.05 ± 0.65 


13.33 ± 0.71 


13.24 ± 0.64 


14.17 ± 0.60 


Robot 


3.95 ± 0.42 


4.00 ± 0.38 


2.53 ± 0.28 


2.57 ± 0.35 


2.49 ± 0.33 


2.61 ± 0.28 


2.54 ± 0.35 


2.52 ± 0.32 


18.58 ± 0.61 


Statlog 


16.34 ± 1.15 


17.19 ± 1.63 


16.12 ± 1.94 


17.45 ± 1.74 


17.33 4 1 .42 


16.36 1 1.67 


17.40 ± 1.34 


17.45 ± 1.51 


23.03 : 2.33 


Vowel 


13.84 1 2.73 


14.40 ± 2.27 


6.79 ± 1.31 


7.62 ± 2.02 


7.17 ± 1.50 


6.30 ± 1.99 


6.18 it 1.19 


6.79 ± 1.17 


14.53 ± 3.30 


Wine 


2.13 ± 1.54 


2.13 ± 1.63 


1.69 ± 1.52 


2.25 ± 1.18 


1.91 ± 1.30 


1.91 ± 1.76 


2.25 ± 1.59 


2.36 ± 1.54 


2.81 ± 1.52 


Yeast 


40.36 ± 1.21 


40.38 ± 1.06 


40.63 ± 1.21 


42.53 : 1.42 


41.19 ± 1.57 


40.70 1 1.68 


4N.09 : 18.30 


41.67 ± 1.31 


40.26 ± 1.10 


Steel 


29.85 ± 1.86 


30.00 ± 2.61 


27.37 ± 1.18 


28.31 ± 1.39 


27.41 ± 1.21 


27.41 ± 1.22 


28.09 ± 1.03 


27.50 ± 1.24 


31.57 ± 2.07 



MEAN 15.59 ± 1.31 I 15.S0 ± 1.41 I 13.97 ± 1.15 I 14.73 ± 1.56 I 14.33 ± 1.06 I 13.97 ± 1.23 I 15.21 ± 3.20 I 14.35 ± 1.04 I 19.04 ± 1.57 



TABLE 4: Number of selected classifiers with the diverse ensemble setup out of 130 (mean ± standard deviation). 



DB 


WS 
h 


CWS 

!i | Zi - / 2 


LSG 

h | h-h 


Segment 


21.50 ± 4.62 


63.50 : 25.72 


30.80 : 34.92 


97.40 1 24.40 


80.40 1 14.93 


Waveform 


36.60 ! 19. 11 


23.30 : 37.59 


47.00 ± 57.31 


11.20 it 2.30 


12.10 ± 5.38 


Robot 


41.80 ± 9.02 


18.60 ± 5.97 


14.00 ± 4.55 


18.50 ± 4.53 


13.30 ± 2.63 


Statlog 


36.10 ± 34.75 


14.30 ± 10.85 


49.20 ± 56.13 


30.60 ± 36.31 


11.20 ± 12.42 


Vowel 


108.90 ± 44.48 


37.80 =: 32.62 


57.30 ± 62.64 


128.00 :: 6.32 


13.80 ± 3.99 


Wine 


130.00 ± 0.00 


121.30 ± 18.60 


117.10 ± 40.44 


93.50 -1 5S.86 


91.60 ± 61.83 


Yeast 


119.10 ± 34.47 


121.00 ± 28.46 


40.40 ± 47.33 


130.00 + 0.00 


9.80 ± 3.46 


Steel 


41.90 ± 32.05 


42.10 ± 6.85 


35.30 ± 8.10 


51.00 i. 16.62 


35.20 ± 11.93 



MEAN | 66.99 I 26.10 | 55.24 : 20.S3 | 43.89 : l.s.u ; | jiin I [,s.n7 | 33.43 ± 14.57 ] 



to use either CWS or LSG combination with h — h 
regularization. If training time is also crucial, CWS with 
h — h regularization seems to be the best option. 

Error percentages and number of selected classifiers 
for the non-diverse ensembles are given in Tables [5] 
and [6] respectively. With the non-diverse ensembles we 
are even able to increase the accuracy with much less 
number of base classifiers with sparse regularization in 
CWS and LSG. On the average, l\ — I2 regularization 
results in lower error percentages for both CWS and 
LSG, but the results are not statistically significant. But, 
the number of selected classifiers is much less with l\ — h 
regularization than that of l\ regularization. Except stat- 
log dataset, lowest error percentages are obtained with 
the sparse combinations with much less base classifiers 
than that of I2 regularization which uses 154 base clas- 
sifiers. If we compare different combination types with 
the I2 norm, on the average we see that, unlike in the 
diverse ensemble setup, WS and CWS outperforms LSG 
in four databases. We can conclude that if the posterior 
scores obtained from base classifiers are correlated, non- 
complex combiners are more powerful since complex 
combiners may result in overfitting. 

8 Conclusion 

In this paper, we suggested using hinge loss function 
with regularization to learn the parameters (or weights) 
of linear combiners in stacked generalization. We are 
able to obtain better accuracies with the hinge loss 
function than conventional least-squares estimation of 
the weights. Results also indicate the importance of the 
regularized learning of the weights. We also proposed 
h — h norm regularization (or group sparsity) to obtain 
a reduced number of base classifiers so that the test time 
is shortened. Results indicate that we can use smaller 
number of base classifiers with a small sacrifice in the 
accuracy with the diverse ensemble. We show that I1 — I2 
regularization outperforms l\ regularization in terms of 



both accuracy and the number of selected classifiers. 
With the non-diverse ensemble setup, we even obtain 
better accuracies using sparse regularizations. If training 
time is crucial, we suggest using CWS type combination. 
And if test time is also important, we suggest using 
group sparsity regularization. 
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